%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from IPython.core.display import Image
Housing_Violation = pd.read_csv("~/Desktop/Khanooja_Gurtej_Assignment_1/Part_A/Housing_Code_Violations.csv")
# Doing a lazy evaluation to see the first few rows of the dataset. If the data is labelled properly we can get a good
# idea what the headers can mean
Housing_Violation.head()
Some inferences from the above result is as follows:
#Creating copy of the dataset, all the manuplations will be done on the new dataset
Housing_Violation_Copy = Housing_Violation.copy()
#Removing spaces from coloumn_name
Housing_Violation_Copy = Housing_Violation.rename(columns={"Address": "Address", "Parcel": "Parcel", "Code Number": "Code_Number",
"Code Description": "Code_Description","Date Cited": "Date_Cited","Date Corrected": "Date_Corrected",
"Violation Status": "Violation_Status","Case Number": "Case_Number", "Code Case Status":"Code_Case_Status"})
Housing_Violation_Copy.shape
From above we can see that the dimensions of our datasets is 6689 x 9. Thr 9 attributes in the datset and their explinations are listed above.
#Removing white Spaces from the Address and Stripping Latitude and Longitude from the Address
Address = Housing_Violation_Copy["Address"].str.split("\n").str[0] + " " +Housing_Violation_Copy["Address"].str.split("\n").str[1]
Address.head()
#Saving Latitude and Logitude in a different Variable
Coordinates = Housing_Violation_Copy["Address"].str.split("\n").str[2]
Coordinates.head()
#Replacing the old address with the new one without "/n" and stripped off latitude and longitude
Housing_Violation_Copy["Address"] = Address
#Creating a new coloumn for latitude and longitude
Housing_Violation_Copy["Coordinates"] = Coordinates
#This is how the transformation in dataset looks
Housing_Violation_Copy
Housing_Violation_Copy.info()
From above we see the datatype for each coloumn and see the number of non-null values for each coloumn. It looks like parcel has some non numm values. Let's explore more about it.
Housing_Violation_Copy.isnull().sum()
From above we see that parcel has 153 missing values in form of null values. Let's check if there are more missing values in the dataset, mostly the most common form for missing values is having 0s instead of null.
for i in Housing_Violation_Copy:
print(sum(Housing_Violation_Copy[i] == 0))
From above we see there are no missing values in form of 0s.
The dataset looks clean and ready for further exploratory analysis. Let's anaswer some questions before we move on:
We found out that there were 153 Null values in "parcel coloumn" but the exact meaning of parcel is not clear from the data source, moreover just by looking at the variable it seems to be not so important for analysis, so at this point we don't need to take any further steps for the missing data.
Yes, the address had white spaces in them. We stripped off the white spaces and improved the dataset to remove them. Moreover, we also stripped the latitude and longitude and put it in a seperate coloum. It might be useful in analysis as a seperate entity.
As such the only bad data we had was the address, which was dealt as explained above. There are no missing values where imputation is requried.
#Let's see if there are some same addresses which received multiple violations
Housing_Violation_Copy["Address"].value_counts()
It seems from the above result that there are multiple addresses with more than one violations. Let us explore how many of these houses exist and if there is something unique about them.
#Converting the value count core series to dictionary
Housing_Violation_Dict = Housing_Violation_Copy["Address"].value_counts().to_dict()
#Finding out number of addresses where multiple Violations are Registered
Multiple_Violations = {}
for key, value in Housing_Violation_Dict.items():
if Housing_Violation_Dict[key] > 1:
Multiple_Violations.update({key:value})
len(Multiple_Violations)
From above we can see that there are 1375 addresses which have more that one violation registered against them
Single_Violation = {}
for key, value in Housing_Violation_Dict.items():
if Housing_Violation_Dict[key] == 1:
Single_Violation.update({key:value})
len(Single_Violation)
From the above two pieces of code we can see that there are only 1106 addresses that have single violation registered and 1375 addresses have more than one violation registered against them.
#Maximum number of violation against a single address listed in dataset
Max_Violation = max(Multiple_Violations, key = Multiple_Violations.get)
print("Maximum number of violations happend at" + " " + Max_Violation + " " + "and number of violations were")
print(Multiple_Violations[Max_Violation])
Housing_Violation_Copy["Address"].value_counts().plot()
Graphically also we can verify that more that 50 percent of addresses have more than one violation
#Let's take the address with most violations and see what's unusually wrong with it
Address_Most_Violation = Housing_Violation_Copy[Housing_Violation_Copy.Address == "10 Magazine St Cambridge, MA 02139"]
Address_Most_Violation
Address_Most_Violation.Code_Number.value_counts()
Address_Most_Violation.Code_Description.value_counts()
Just from quick go through of above data, the address which has highest number of registered violations has 33 violations. Some violations have been repeated several times and there are 10 unique violations. The most common violation is 410.351 which is Owner's Inst. and Maint. Responsibilities. Mostly of the cases are closed until the most recent one.
Meaning of parcel attribute was not clear and was not explicitly mentioned with the data source documentation, from the above analysis it can be seen that parcel is same for the same address, so it should have a relation with the address for which the violation is registered. But how is it related to address is something we are not sure. Let's explore further to see if we can get the relation.
Housing_Violation_Copy.Parcel.value_counts()
We can see that parcel code "76-126" has 55 counts which is more that the number of counts for address with maximum violation. So specifically it is not necesssary that a unique parcel is related to unique address. Let's see what else can be done.
Housing_Violation_Copy[Housing_Violation_Copy.Parcel == "76-126"]
For parcel "76-126" we can observe that all addresses have the same street i.e. harvard street, so maybe the parcel is same for all addresses on the same street rather than unique parcel for unique address.
Housing_Violation_Copy.Parcel.value_counts().plot(label = "dfafsas")
The graph shows that there are large number of addresses with one parcel, which means that they are the only house on the street with the housing violation. the number is gradually increasing with a sharp increase in the end, 55 being the higest violation for parcel 76-126 harvard street. There are 3 addresses sharing 55 violations on that street.
#Extracting number of different code numbers and counting the occourance of each
Housing_Violation_Copy.Code_Number.unique()
len(Housing_Violation_Copy.Code_Number.unique())
print ("There are 54 different code numbers as listed above. These are the violations ever executed since" + " " +
Housing_Violation_Copy.Date_Cited.min() +".There might be other violations as well but we might not know as they might never have been executed")
Housing_Violation_Code = Housing_Violation_Copy.Code_Number.value_counts()
Housing_Violation_Code
More than 65 percent of the violations come under first five catrgories of code number. Let's explore the top reasons for the violations
#Exploring top reasons for the violations
Housing_Violation_Copy.Code_Description.value_counts()
Top_Five = Housing_Violation_Copy.Code_Description.value_counts()
print("So top 5 reasons for violations are:")
Top_Five.nlargest(5)
#Percentage of violations by top five violations
Top_Five_Violation = ((sum(Top_Five.nlargest(5))/sum(Top_Five))*100)
print(Top_Five_Violation)
#Distribution of Code Description and Code Number
sns.barplot(x=Housing_Violation_Copy.Code_Description.value_counts().index,
y=Housing_Violation_Copy.Code_Description.value_counts())
From the above graph we can see that are able to see the graphical representation of distribution for code description and code number. From the distribution graph the reason for violation with highest occurance is little above 1600, which was 1607 to be exact from the analysis done above.
#Converting python object to datetime object
Date_Cited_Datetime = pd.to_datetime(Housing_Violation_Copy["Date_Cited"])
Date_Corrected_Datetime = pd.to_datetime(Housing_Violation_Copy["Date_Corrected"])
#Let's see when was the first violation cited
min(Date_Cited_Datetime)
#Let's see when was the most recent violation cited
max(Date_Cited_Datetime)
#Let's see when was the first violation corrected
min(Date_Corrected_Datetime)
#When was the last violation corrected
max(Date_Corrected_Datetime)
There are some intresting things that can be seen from the above analysis:
First violation ever to be cited was in year of 2011 while first violation ever to be corrected was in year 2013, which is weird and have to be looked in more detail to know what exactly is happening.
The last violation in data to be cited and the last violation to be corrected which can defitely be 2 different things but both of them have just 1 min difference. We have to see what exactly is happening there.
#Length of time period from the first citation till now
max(Date_Cited_Datetime) - min(Date_Cited_Datetime)
#Length of time period from the first corrected violation till now
max(Date_Corrected_Datetime) - min(Date_Corrected_Datetime)
The above difference between the time period of ciated vs corrected make sense as the first violation ever to be corrected was in 2013 while to be cited was in 2011
#Stripping time from datetime in Date_Cited and Date_Corrected and putting them in a dataframe to do further analysis
Date_Cited = Housing_Violation_Copy["Date_Cited"].str.split(" ").str[0]
Date_Corrected = Housing_Violation_Copy["Date_Corrected"].str.split(" ").str[0]
Date_Cited.head()
Date_Corrected.head()
Date_Cited_Corrected = pd.concat([Date_Cited, Date_Corrected], axis=1)
Date_Cited_Corrected.head()
#Converting the dataframe to datetime object
DateCited_Datime = pd.to_datetime(Date_Cited_Corrected["Date_Cited"])
DateCited_Datime.head()
DateCorrected_Datime= pd.to_datetime(Date_Cited_Corrected["Date_Corrected"])
DateCorrected_Datime.head()
#Calculating the difference between the Date Corrected and Date Cited
Date_Diff = DateCorrected_Datime - DateCited_Datime
Date_Diff.head()
type(Date_Diff[1])
#Since each difference between the dates is a timedelta object, we have to convert it to integer to be able to plot the
#distribution
a = Date_Diff.dt.days
sns.distplot(a)
From above distribution graph we can see that there are some negative values between the difference of corrected date and cited date which basically mean that the violation was corrected before it could be cited which is practically impossible. It means that there is was some error while putting in that data. Let's see the erronous data and remove it to carry forward further analysis.
#Finding the index of the
len_values =0
index_values = []
for index, value in enumerate(a):
if value < 0:
print (index, value)
len_values += 1
index_values.append(index)
len_values
So there are 12 values at given above index and with negative values of difference as given above. Let's see what is this data and remove it from the dataset to carry further analysis.
index_values
for i in index_values:
print (Housing_Violation_Copy.iloc[i])
Out of above 12 observation all have Date corrected earlier than the citation date. One more thing to note here is that first 6 observations have the same mistake and date difference of -71 days.
#Dropping 12 rows corrosponding to indexes haveing negative date differences
DateCited_Datime_New = DateCited_Datime.drop(DateCited_Datime.index[index_values])
DateCorrected_Datime_New = DateCorrected_Datime.drop(DateCorrected_Datime.index[index_values])
#Repeting the same procedure for plotting the distribution for the difference but by dropping the 12 negative difference
#variables
Date_Diff_New = DateCorrected_Datime_New - DateCited_Datime_New
b = Date_Diff_New.dt.days
#Distribution plot for new values
sns.distplot(b)
#Summary statistics for difference in dates after dropping the misleading data
b.describe()
Conclusions from the distribution plot and summary statistics:
From distribution plot we can see kind of normal distribution curve with a small curve part before 0 holding no values.
A large voloume of data seems to be enclosed around 100 mark on the graph which can be cross validated from summary statistics suggesting that 75 percent of violation cases were closed within 118 days from the time when they were issued.
Mean amount of time is for the case to get closed is 203.1 days which is clealy reason of outliers where cases take more than 250 days to close. Values are extending upto 1736 days which is clealy a sign of baiseness of the mean.
Housing_Violation_Copy.Violation_Status.value_counts()
sns.barplot(x=Housing_Violation_Copy.Violation_Status.value_counts().index,
y=Housing_Violation_Copy.Violation_Status.value_counts())
The above graph gives us information about number of cases corrected v/s the cases cited. The percentage of corrected cases is 80.17 percent and percentage of cited cases are 19.83 percent.
Housing_Violation_Copy.Case_Number.value_counts()
There is something interesting happening here. Earlier from initial look of our data, it looked like the "case number" attribute is unique for each violation but now since we have another look at it, there is data with same case number applied to multiple violations. Let's see what's happening here. Let's explore the case with 28 same case number.
Housing_Violation_Copy[Housing_Violation_Copy["Case_Number"] =="HOU-003259-2014"]
Housing_Violation_Copy[Housing_Violation_Copy["Case_Number"] =="HOU-009675-2015"]
Housing_Violation_Copy[Housing_Violation_Copy["Case_Number"] =="HOU-009675-2015"]
We observed individual observation for 3 random case number (HOU-003259-2014, HOU-009675-2015, HOU-009675-2015) which have multiple values in the coloumn as above. It necessarly does not mean the case numbers are particular to an address but can have multiple addresses but an observation for addresses is that all the address having the same case number belong to same street. Apart from that one more observation that is common to all the values for the same case number is that they all have the same citation date and the correction date. There can be some infrences drawn from above statment but validity of those necessarly can't be proved through the data. Invalidity can be proved though.
Housing_Violation_Copy.Code_Case_Status.value_counts()
sns.barplot(x=Housing_Violation_Copy.Code_Case_Status.value_counts().index,
y=Housing_Violation_Copy.Code_Case_Status.value_counts())
When we talk about "code case status", it can be grouped in three categories. Closed, active and court. Closed cases are the ones which have been dealt in the court and result have been announced by the court on those cases, active are the ones which have been cited and are not in the court and court cases are the ones o which have active hearing in the court. We see that that out of 6688 total cases 6008 cases have been closed i.e. almost 90 percent, means that court is doing a good job dealing with the violations. If we want to have a closer look we can even see cases of last two months and see what percentage of cases are in what category and if a case is active since a long time we can analyze individual cases and see where is the exact problem.
Since I am not doing the clustering using the latitude logitude data as stated above, I got a clustered version from cambridge open source data website, the .png version of which is included below to show how the violation addresses are distributed all over the city. The area near the harvard university seems to have least number of violations out of all. Lot's of violations are accumlated near Bishop allen dr.
Image(filename=("/Users/Gurtej/Desktop/Map.png"), )
Note: As such there is no scope of drawing correlation graphs because of the kind of variables we have in the dataset
----------------------------------------------THE_END-----------------------------------------------------------------